Your project should incorporate one or both of the two main themes of this course: network analysis and text processing. You need to show all of your work in a coherent workflow, and in a reproducible format, such as an IPython Notebook or an R Markdown document. If you are building a model or models, explain how you evaluate the “goodness” of the chosen model and parameters.
We’ll schedule a short presentation for each team, either in our last scheduled meet-up or in additional office hours to be scheduled during the last week of classes.
You may work in a team of up to three people. Each project team member is responsible for understanding and being able to explain all of the submitted project code. Remember that you can take work that you find elsewhere as a base to build on, but you need to acknowledge the source, so that I base your grade on what you contributed, not on what you started with!
We chose a dataset that we found on the Data Is Plural — Structured Archive that consists of 56,498 recipes from various cuisines that were scraped from 3 popular recipe websites.
Description from Data Is Plural
For their 2011 paper, “Flavor network and the principles of food pairing,” four scientists analyzed 56,498 recipes downloaded from three websites — allrecipes.com, epicurious.com, and menupan.com. To support their findings, the authors published two datasets. One names the cuisine and ingredients for each recipe. The other dataset counts how often any two ingredients appeared in the same recipe. (Parmesan cheese and beef appeared together 93 times; starfruit and Algerian geranium oil just once.) Related: “food2vec – Augmented cooking with machine intelligence,” published last month. h/t Rob Barry.
The original research article, Flavor network and the principles of food pairing, can be found here: Flavor network and the principles of food pairing
The additional related article cited above can be found here: food2vec – Augmented cooking with machine intelligence
The data is easily downloaded in CSV format from the Electronic supplementary material section of the Flavor network and the principles of food pairing research paper webpage.
The data downloads consist of the following two files:
Structure of the srep00196-s2 dataset:
The paired ingredients are listed one each in the first two columns and the count of the number of times that pair of ingredients are found in the same recipe in all recipes across all cuisines in the dataset is in the third column. We decided not to use this dataset, since we opted instead to create our own counts grouped by cuisine from the other file. Information about the cuisines for each pairing are not available in this file.
Additionally, there is some confusion about what this data actually represents since a different source, Recipes for learning, suggested that the third column in fact represents the number of flavor compounds that the two ingredients share. As a result, we decided not to use this data and to create our own list of common pairs of ingredients from the other file.
Structure of the srep00196-s3 dataset:
Significant data manipulation was necessary to reshape and analyze this dataset both as a text and as a network.
import pandas as pd
import numpy as np
from IPython.display import Markdown
import nltk
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
from nltk.tokenize import word_tokenize
import networkx as nx
from networkx.algorithms import bipartite as bi
from scipy import stats
import math
import random
random.seed(250)
import matplotlib.pyplot as plt
%matplotlib inline
# jupyter setup
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
plt.rcParams["figure.figsize"] = (15,12)
file_name ='https://raw.githubusercontent.com/betsyrosalen/DATA_620_Web_Analytics/master/Final_Project_Data/srep00196-s3.csv'
columns = ['Cuisine', 'ingred1','ingred2','ingred3','ingred4','ingred5','ingred6','ingred7','ingred8','ingred9',
'ingred10','ingred11','ingred12','ingred13','ingred14','ingred15','ingred16','ingred17','ingred18',
'ingred19','ingred20','ingred21','ingred22','ingred23','ingred24','ingred25','ingred26','ingred27',
'ingred28','ingred29','ingred30','ingred31','ingred32']
recipes = pd.read_csv(file_name, header=None, skiprows=4, names=columns, encoding = 'utf-8',)
recipes.head()
print("There are "+str(recipes.shape[0])+ " recipes with a maximum of "+str(recipes.shape[1]-1)+" ingredients each.")
First we have to figure out how to get all the ingredients into one string that can be written to a file to create each text in our corpus. After some trial and error we got the following code to do what we needed.
for index, r in recipes.head().iterrows():
string = ""
for col in r[1:32]:
if type(col) == str:
string = string+str(col)+" "
print(string)
Now we can use a function we found on StackOverflow and modify it to create our corpus. I am putting this code in a markdown cell so that it doesn't run again each time we run the notebook. Copy the code into a code cell or change the cell to a code cell to run it to create your text files if you want to reproduce this analysis.
def CreateCorpusFromDataFrame(corpusfolder,df):
for index, r in df.iterrows():
id = 'recipe'+str(index)
title = 'recipe'+str(index)
body = ""
for col in r[1:32]:
if type(col) == str:
body = body+str(col)+" "
cuisine = r['Cuisine']
fname = str(cuisine)+'_'+str(id)+'.txt'
corpusfile = open(corpusfolder+'/'+fname,'w')
corpusfile.write(str(body))
corpusfile.close()
CreateCorpusFromDataFrame('Final_Project_corpusfolder',recipes)
Finally we can run the following code to create our corpus in NLTK.
corpus = CategorizedPlaintextCorpusReader('Final_Project_corpusfolder/',r'.*', cat_pattern=r'(.*)_.*')
corpus.fileids()[1:15]
corpus.categories()
corpus.words(categories='African')
corpus.sents(categories=['EasternEuropean', 'NorthernEuropean', 'SouthernEuropean', 'WesternEuropean'])
recipes['Cuisine'].value_counts()
North American recipes represent a disproportional number of recipes in the dataset with 41,524 out of 56,498 total recipes, and 6 of the 11 cuisines are seriously underrepresented with only 250 to 645 recipes each in the dataset. So the following analysis of ingredients in the recipe dataset overall will be skewed toward North American Cuisine. Future analysis with randomly-chosen and equal-sized subsets of each cuisine could provide more insight about the overall picture. Possibly even choosing subsets of a size proportionate to populations might give us important information about the relative global importance of recipe ingredients.
# Add #of_ingred column
recipes['#of_ingred'] = 32-recipes.isnull().sum(axis=1)
# Plot distribution
recipes['#of_ingred'].hist(bins=32, figsize=(10,5));
round(recipes['#of_ingred'].mean(), 2)
As mentioned before our recipes have a range from 1 to 32 ingredients each. The distribution of the number of ingredients in all the recipes is slightly right skewed. Generally most recipes have anywhere from 2 to 15 ingredients and an overall average of 8.22 ingredients per recipe.
num_ing = recipes.groupby('Cuisine').mean().sort_values(by=['#of_ingred'], ascending=False).reset_index().iloc[:,[0,1]]
num_ing
num_ing.plot.bar(x='Cuisine', figsize=(10,3));
Southeast Asian recipes have the most ingredients per recipe on average at 11.32 ingredients. African and South Asian recipes also have more than 10 ingredients on average. Northern European recipes have the fewest average number of ingredients at 6.82, a little more than half the average of Southeast Asian recipes and North American recipes have the second smallest number of ingredients on average at 7.96.
recipes['#of_ingred'].hist(by=recipes['Cuisine'], bins=12, figsize=(15,10));
We can visually see the differences in the distributions of ingredients per cuisine above. The peak of the distribution for all European cuisines as well as North American cuisine clearly falls under 10 ingredients, while the peak for Southeast Asian, African, South Asian, and Latin American cuisines clearly falls at about 10 ingredients. Although we can see the range for Eastern European, Southern European, and Northern European cuisines extends to 30 while the rest are 25 or less.
# Add recipeID (recipe#) column
recipes['recipe#'] = recipes.index+1
ingredient_list = pd.melt(recipes, id_vars=['Cuisine','recipe#'], value_vars=columns[1:]).dropna()
ingredient_list.rename(columns={"variable":"ingred#", "value":"ingredient"}, inplace=True)
#ingredient_list.head()
#ingredient_list.tail()
len(ingredient_list)
ingredient_counts = pd.DataFrame(ingredient_list.ingredient.value_counts())
ingredient_counts.shape
ingredient_counts.head(10)
There are 464,407 total ingredients included in all 56,498 recipes and 381 unique ingredients. Eggs are the most popular ingredient, followed by wheat, butter, onion and garlic.
ingredient_counts.tail(17)
There are 16 ingredients that only show up once in all recipes: mate, jamaican rum, geranium, roasted nut, beech, jasmine tea, muscat grape, angelica, durian, strawberry jam, sturgeon caviar, roasted pecan, pelargonium, lilac flower oil, roasted hazelnut, and emmental cheese. Although it is not surprising given the somewhat exotic ingredients in the list.
len(corpus.words())
len(corpus.sents())
ingredients_sorted = sorted(list(corpus.words()))
len(set(ingredients_sorted))
Our corpus lengths agree (close enough) with the numbers we got using Pandas: 464,405 total ingredients included in 56,498 recipes with 381 unique ingredients. Not sure why we are off by just two ingredients, but the difference is so small that it should not affect our analysis.
fdist = nltk.FreqDist(corpus.words())
def top_half(ingredients, fdist):
tw=len(ingredients)
tcount=0
wcount=0
for word, count in fdist.most_common():
tcount=tcount+count
wcount=wcount+1
if tcount>(tw/2):
return(wcount)
break
top_half(ingredients_sorted, fdist)
The following 22 ingredients represent half of the total ingredients in the recipes.
pd.DataFrame(fdist.most_common()[:22], columns=['ingredient', 'count'])
plt.figure(figsize=(15,7))
fdist.plot(22, cumulative = True);
The top 50 ingredients each show up 2500 times or more.
plt.figure(figsize=(17,5))
fdist.plot(50, );
2500/56498
cuisine_counts = pd.DataFrame(ingredient_list.groupby('Cuisine').ingredient.value_counts())
cuisine_counts.columns = ["count"]
cuisine_counts.groupby('Cuisine').head(1)
The top ingredient varies by cuisine however, surprisingly the top ingredient overall, eggs, does not take the top slot for any one of the cuisines separately. This may indicate that eggs are cross-culturally relevant as an ingredient while other common ingredients may be more culturally specific. Butter, the third most popular ingredient overall takes the top spot for 4 cuisines in our list: Eastern European, North American, Northern European, and Western European.
cuisine_counts.groupby('Cuisine').head(5)
The same 4 cuisines Eastern European, North American, Northern European, and Western European, all share the same top 5 ingredients: butter, eggs, wheat, milk/cream and onion. Onion, garlic and olive oil show up regularly in the other cuisines top 5 lists, which is not surprising considering they show in slots 4, 5 and 10 respectively overall. Not surprisingly, eggs show up in 6 out of the 11 cuisines.
cuisine_props = pd.DataFrame(ingredient_list.groupby('Cuisine').ingredient.value_counts(normalize=True)*100)
cuisine_props.columns = ["percent"]
cuisine_props.reset_index().groupby('Cuisine').head(1).sort_values(by=['percent'], ascending=False)
The top ingredient in each cuisine shows up in a range from 5% to just over 7% of all recipes for each cuisine except for Northern European cuisine whose top ingredient, butter, shows up in over 9% of recipes!
cuisine_props.sort_values(by=['percent'], ascending=False).head(20)
Looking at the top 20 ingredients as percentages of their respective cuisines shows a slightly different picture. No Northern American ingredients even show up on the list because even the most common ingredient doesn't show up as frequently in Northern American cooking as some ingredients do in other cuisines, indicating that there may be more variety of ingredients between recipes in Northern American cooking. At the other end of the spectrum Northern, Western, and Eastern European cuisines all have their top three ingredients and Latin American has its top 4 in the list indicating that there may be more commonality or less variety of ingredients between recipes in their cuisines.
In order to see which ingredients most commonly go together for each cuisine and the similarities and differences between common pairings by cuisine we first need to reshape our data into a dataframe with all the pairings across all recipes.
pairs = pd.DataFrame({'Cuisine':[], 'ingred1':[], 'ingred2':[]})
for i in range(1,32):
for j in range((i+1),33):
temp=recipes.iloc[:,[0,i,j]]
temp.columns=['Cuisine','ingred1','ingred2']
temp=temp.dropna()
pairs=pairs.append(temp,ignore_index=True)
pairs.head()
pairs.shape
# Just double checking that there are no NA's
pairs = pairs.dropna()
pairs.shape
There are 2,036,125 records but we reduced that number by condensing all the duplicate entries into one row and adding a column for counts. Since we know that some cuisines are over or under represented we also added a column for the percentage of the total recipes for each cuisine that include that pairing. The resulting dataframe is below.
pair_counts = pairs.reset_index().groupby(['Cuisine','ingred1','ingred2'], as_index=False).count()
pair_counts.rename(columns={"index":"counts"}, inplace=True)
pair_counts['total'] = pair_counts.groupby('Cuisine').counts.transform('sum')
pair_counts['percent'] = pair_counts['counts']/pair_counts['total']*100
pair_counts.head()
pair_counts.shape
pair_counts.sort_values(by='counts', ascending=False, inplace=True)
pair_counts.head(10)
Out of all pairs wheat and egg came in first place when sorted by the total count. This is not surprising since North American cuisine is overrepresented in the dataset. But also, looking at the top 10 pairs of ingredients, it seems the dataset may be heavily skewed toward deserts or at least baking recipes.
pair_counts.sort_values(by='percent', ascending=False, inplace=True)
pair_counts.head(10)
Sorting again by the percent of total pairs in each cuisine that the selected pair represents gives a slightly different picture. We see a few savory pairs hit the top 10: tomato and onion, onion and cayenne, and olive oil and garlic.
We also see that the same three pairings that were the top 3 for North America and overall, are the top 3 for Northern European cuisine as well and are present in twice as many recipes by percentage as they are in North American cuisine.
pair_counts.sort_values(by=['Cuisine','percent'], ascending=[True, False]).groupby('Cuisine').head(3)
In the following sections we analyzed the top ingredient pairs for each cuisine via network graphs and the ingredients that make up half of each cuisine's total ingredients as well as the top 50 ingredients in each cuisine via text analysis.
Ingredient pairs were filtered out by the number of occurrences to include approximately the top 80-90 pairs for each cuisine to make the visualizations easier to understand.
def setup(category):
fdist = nltk.FreqDist(corpus.words(categories = category))
ingredients_sorted = sorted(list(corpus.words(categories = category)))
return(fdist, ingredients_sorted)
def cuisine(category, ingredients_sorted):
ing = len(corpus.words(categories = category))
rec = len(corpus.sents(categories = category))
uni_ing = len(set(ingredients_sorted))
mkdwn = Markdown("""The **{category}** recipes in our dataset include **{ing} total ingredients** listed in
**{rec} recipes** with **{uni_ing} unique ingredients**.""".format(category=category, ing=ing,
rec=rec, uni_ing=uni_ing))
return(mkdwn)
def half(category, ingredients_sorted, fdist):
top = top_half(ingredients_sorted, fdist)
mkdwn = Markdown("""The following **{top} ingredients represent half of the total ingredients**
in the **{category}** recipes.""".format(top=top, category=category))
return(mkdwn)
pair_countsA=pair_counts[(pair_counts['counts']>30) & (pair_counts['Cuisine']=='African')]
pair_countsA.shape
A=nx.from_pandas_edgelist(pair_countsA, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in A.edges(data=True)]
nx.draw_circular(A, with_labels=True, node_color="indianred", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="darksalmon", alpha=0.5)
The centrality of spices to African cooking really stands out in the graph above which includes: onion, garlic, cumin, turmeric, cilantro, cayenne, ginger, parsley, cinnamon and saffron.
Chicken, lamb and chicken broth are the only animal-based foods in the graph.
category = "African"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:16], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(16, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsEA=pair_counts[(pair_counts['counts']>200) & (pair_counts['Cuisine']=='EastAsian')]
pair_countsEA.shape
EA=nx.from_pandas_edgelist(pair_countsEA, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in EA.edges(data=True)]
nx.draw_circular(EA, with_labels=True, node_color="chocolate", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="sandybrown", alpha=0.5)
East Asian cooking is clearly heavily influenced by rice, cayenne, ginger, soy sauce, scallions, garlic, and sesame oil.
Fish, eggs, and beef are the only animal-based foods in the graph.
category = "EastAsian"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:13], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(13, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsEE=pair_counts[(pair_counts['counts']>18) & (pair_counts['Cuisine']=='EasternEuropean')]
pair_countsEE.shape
EE=nx.from_pandas_edgelist(pair_countsEE, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in EE.edges(data=True)]
nx.draw_circular(EE, with_labels=True, node_color="darkgoldenrod", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="goldenrod", alpha=0.5)
The centrality of wheat, butter, and eggs to Eastern European cooking is very evident in the thick triangle created by the edges between them in the graph above. The inclusion of yeast in the graph with edges connecting it to those same three ingredients plus milk indicates a lot of baked goods, possibly breads.
The animal-based foods and especially fats are prominent in the graph which includes: butter, eggs, milk, cream, beef, lard, bacon, and milk fat.
category = "EasternEuropean"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:16], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(16, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsLA=pair_counts[(pair_counts['counts']>220) & (pair_counts['Cuisine']=='LatinAmerican')]
pair_countsLA.shape
LA=nx.from_pandas_edgelist(pair_countsLA, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in LA.edges(data=True)]
nx.draw_circular(LA, with_labels=True, node_color="olivedrab", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="darkkhaki", alpha=0.5)
Onion, garlic, cayenne and tomatoes are central to Latin American cooking and stand out in the graph above with thick edges connecting all of them to each other. Although corn is there it's surprising to me that it isn't more prominent, but this may be due to the lumping together of all Latin American cuisines into one broad category.
Animal-based foods are more prominent Latin American cuisine and especially cheeses with the inclusion of beef, cheese, eggs, cream, chicken, and cheddar cheese.
category = "LatinAmerican"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:14], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(14, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsME=pair_counts[(pair_counts['counts']>29) & (pair_counts['Cuisine']=='MiddleEastern')]
pair_countsME.shape
ME=nx.from_pandas_edgelist(pair_countsME, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in ME.edges(data=True)]
nx.draw_circular(ME, with_labels=True, node_color="seagreen", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="mediumseagreen", alpha=0.5)
Olive oil, lemon juice, wheat, onion, garlic and eggs look like the most heavily connected foods in Mediterranean cuisine.
Chicken, lamb, chicken broth, eggs, butter and cream are the only animal-based foods in the graph.
category = "MiddleEastern"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:19], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(19, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsNA=pair_counts[(pair_counts['counts']>1500) & (pair_counts['Cuisine']=='NorthAmerican')]
pair_countsNA.shape
NA=nx.from_pandas_edgelist(pair_countsNA, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in NA.edges(data=True)]
nx.draw_circular(NA, with_labels=True, node_color="darkcyan", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="lightseagreen", alpha=0.5)
Like Eastern European cooking North American cooking also has wheat, butter, and eggs front and center with the addition of milk. We can see two overlapping thick triangles created by the edges between them in the graph above. The inclusion of yeast here as well also indicates baked goods, possibly breads, but we also see vanilla, cocoa, and cane molasses which may indicate sweets.
Milk, lard, eggs, cream, butter, and chicken are the only animal products.
category = "NorthAmerican"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:21], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(21, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsNE=pair_counts[(pair_counts['counts']>8) & (pair_counts['Cuisine']=='NorthernEuropean')]
pair_countsNE.shape
NE=nx.from_pandas_edgelist(pair_countsNE, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in NE.edges(data=True)]
nx.draw_circular(NE, with_labels=True, node_color="steelblue", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="skyblue", alpha=0.5)
In Northern European cuisine, we see even heavier connections between wheat, butter, and eggs than we did in Eastern European or North American cooking. Again we also see yeast here as well which may indicate baked goods like breads, but again we also see vanilla, and cane molasses as well as apples and oranges which again may indicate sweets.
Beef, milk, eggs, cream, butter, and lard are the only animal products.
category = "NorthernEuropean"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:13], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(13, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsSA=pair_counts[(pair_counts['counts']>62) & (pair_counts['Cuisine']=='SouthAsian')]
pair_countsSA.shape
SA=nx.from_pandas_edgelist(pair_countsSA, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in SA.edges(data=True)]
nx.draw_circular(SA, with_labels=True, node_color="cornflowerblue", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="lightsteelblue", alpha=0.5)
Similar to African cooking we see spices are central to South Asian cuisine including: onion, garlic, cumin, turmeric, cilantro, coriander, cayenne, ginger, cinnamon, fenugreek, pepper, and black pepper.
Chicken, yogurt, and butter are the only animal-based foods in the graph.
category = "SouthAsian"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:13], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(13, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsSEA=pair_counts[(pair_counts['counts']>45) & (pair_counts['Cuisine']=='SoutheastAsian')]
pair_countsSEA.shape
SEA=nx.from_pandas_edgelist(pair_countsSEA, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in SEA.edges(data=True)]
nx.draw_circular(SEA, with_labels=True, node_color="slateblue", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="plum", alpha=0.5)
Southeast Asian cuisine is the first we've seen that has a meat, specifically fish, as the most prominent ingredient in the graph. Fish is connected to most of the other ingredients. The only other ingredient that may have more connections is garlic.
Fish, shrimp, and chicken are the only animal-based ingredients in the graph.
category = "SoutheastAsian"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:17], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(17, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsSE=pair_counts[(pair_counts['counts']>230) & (pair_counts['Cuisine']=='SouthernEuropean')]
pair_countsSE.shape
SE=nx.from_pandas_edgelist(pair_countsSE, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in SE.edges(data=True)]
nx.draw_circular(SE, with_labels=True, node_color="mediumvioletred", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="palevioletred", alpha=0.5)
In Southern European cuisine, olive oil seems to be in everything! It dominates the graph creating a fanlike appearance and has a very heavy connection to garlic and onions, and a slightly less heavy connection to tomatoes.
This is the only cuisine besides Latin American cuisine that includes cheeses with the inclusion of cheese, parmesan cheese, and mozzarella cheese. Other animal products include: eggs, butter, chicken broth, and milk.
category = "SouthernEuropean"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:16], columns=['ingredient', 'count'])
plt.figure(figsize=(15,4))
fdist.plot(16, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
pair_countsWE=pair_counts[(pair_counts['counts']>100) & (pair_counts['Cuisine']=='WesternEuropean')]
pair_countsWE.shape
WE=nx.from_pandas_edgelist(pair_countsWE, 'ingred1', 'ingred2', edge_attr='percent')
weights=[edata['percent']*15 for f,t,edata in WE.edges(data=True)]
nx.draw_circular(WE, with_labels=True, node_color="firebrick", node_size=1000, font_size=12, font_weight='bold',
width=weights, edge_color="indianred", alpha=0.5)
Like Eastern European and North American cooking, Western European cuisine also has wheat, butter, eggs and milk front and center. Yeast is noticeably almost absent here however, with just one connection to wheat, so breads may still be part of the baking, but we also see vanilla, cocoa, cinnamon, nutmeg, raisins, and cane molasses which may indicate that sweets are more common in Western European cuisine than in Eastern European or North American.
Milk, eggs, cream, butter, lard, and milk fat are the only animal products.
category = "WesternEuropean"
fdist, ingredients_sorted = setup(category)
cuisine(category, ingredients_sorted)
half(category, ingredients_sorted, fdist)
pd.DataFrame(fdist.most_common()[:19], columns=['ingredient', 'count'])
plt.figure(figsize=(15,7))
fdist.plot(19, cumulative = True);
plt.figure(figsize=(17,5))
fdist.plot(50, );
In order to analyze the cuisines and ingredients as a bipartite network that isn't dominated by the overabundance of Northern American recipes in our dataset we decided to take the proportional data we calculated earlier and multiply the percents by 10 to get the estimated number of recipes in 1000 that would contain each ingredient in each cuisine. We then took the top 50 ingredients from each cuisine and created a graph object from those with cuisines and ingredients as the top and bottom nodes.
network = cuisine_props.groupby('Cuisine').head(50).reset_index()
network['countPer1000recipes'] = round(network['percent']*10, 0)
network.head()
B = nx.Graph()
B.add_nodes_from(network['Cuisine'], bipartite=0)
B.add_nodes_from(network['ingredient'], bipartite=1)
B.add_weighted_edges_from([tuple(d) for d in network[['Cuisine','ingredient','countPer1000recipes']].values])
print(nx.info(B))
cuisine_nodes = {n for n, d in B.nodes(data=True) if d['bipartite']==0}
ingredient_nodes = set(B) - cuisine_nodes
print(bi.density(B, cuisine_nodes))
The functions below will use the edge weights based on the number of recipes in a thousand that contain the ingredient to trim the graph into subgraphs at different minimum weight levels. Any node not connected by at least one edge with the minimum weight will be cut from the graph.
def trim_edges(g, weight=1):
g2=nx.Graph()
my_list=[]
my_list1=[]
for f, to, edata in g.edges(data=True):
if edata['weight'] > weight:
my_list.append(f)
my_list1.append(to)
g2.add_edge(f,to,attr_dict={weight:edata['weight']})
g2.add_nodes_from(my_list, bipartite=0)
g2.add_nodes_from(my_list1, bipartite=1)
return g2
def island_method(g, iterations=5, weight=1):
weights = [edata['weight'] for f, to, edata in g.edges(data=True)]
mn=int(min(weights)) if int(min(weights)) > weight else weight
mx=int(max(weights))
#compute the size of the step, so we get a reasonable step in iterations
step=int((mx-mn)/(iterations-1))
return [[threshold, trim_edges(g, threshold)] for threshold in range(mn,mx,step)]
islands = island_method(B, 8, 15)
print('min weight - ', '# of nodes - ', '# of island subgraphs')
for i in islands:
# print the threshold level, size of the graph, and number of connected components
print(i[0], ' - ', len(i[1]), ' - ', len(list(nx.connected_component_subgraphs(i[1]))))
def set_colors(G):
colors = []
for node, data in G.nodes(data=True):
if data['bipartite'] == 1:
colors.append('wheat') # Ingredients in yellow
else:
colors.append('darkkhaki') # Cuisines in Green
return colors
G0=max(nx.connected_component_subgraphs(islands[0][1]), key=len)
plt.rcParams["figure.figsize"] = (15,15) # set plot size
colors = set_colors(G0) # set colors
weights = [edata['attr_dict'][15]/10 for f, t, edata in G0.edges(data=True)] # set weights
nx.draw(G0, with_labels=True, node_color=colors, node_size=2000, width=weights,
font_size=10, font_weight='bold', edge_color="wheat")
In the graph above we can already see the similarities and differences among cuisines start to take shape. East Asian while still connectedto the rest of the network by a large number of ingredients also has a number of ingredients that are unique to East Asian cuisine alone: coconut, soybeans, sake, roasted sesame seeds, sesame oil, carrots. The same can be said for Southeast Asian, with lime and lime juice. Both East Asian and Southeast Asian share an affinity for shrimp, fish, scallions and soy sauce. Ingredients unique to Southern European cooking include: Parmesan cheese, macaroni, and basil. South Asian includes yogurt and fenugreek which are not found in other cuisines.
We can also see the centrality of ingredients like onion, garlic, black pepper, bell peppers, tomatoes, olive oil, wheat, eggs, butter, cayenne, and cumin to a large number of cuisines.
G1=islands[1][1]
plt.rcParams["figure.figsize"] = (12,12) # set plot size
colors = set_colors(G1) # set colors
weights = [edata['attr_dict'][15]/10 for f, t, edata in G0.edges(data=True)] # set weights
nx.draw(G1, with_labels=True, node_color=colors, node_size=2000, width=weights,
font_size=10, font_weight='bold', edge_color="wheat")
Some of the same relationships are even clearer in the second graph than they were in the first, but now we can also see almonds standing out as a uniquely Northern European ingredient.
G2=islands[2][1]
plt.rcParams["figure.figsize"] = (12,10) # set plot size
colors = set_colors(G2) # set colors
weights = [edata['attr_dict'][15]/10 for f, t, edata in G0.edges(data=True)] # set weights
nx.draw(G2, with_labels=True, node_color=colors, node_size=2000, width=weights,
font_size=10, font_weight='bold', edge_color="wheat")
The third graph really breaks down each cuisine into it's most common ingredients with each cuisine connected to the graph by no more than 8 ingredients. This image gives us an idea of the overall difference in flavors among cuisines from different cultures.
We can also see onion and garlic stand out as the two ingredients central to the largest number of cuisines with 6 and 7 connections respectively.
As expected we can see very clear similarities and differences among the cuisines of different cultures.
There are more similarities between the cuisines of cultures that share continents in general. For example, Asian cuisines show more similarities to each other than they do to Northern American and European cuisines. Northern American and Northern, Western and Eastern European cuisines show many similarities, while Southern European shows more similarity to Latin American, likely due to the cultural heritage of the American vs.
Latin American settlers. Southern European, Latin American and Asian Cuisines have an emphasis on more savory and spicy flavors while North American and European cuisines (other than Southern European) have a more buttery, creamy, carb-based flavors. SoutheastAsian really stands out with the inclusion of fish, the only meat-based ingredient. South Asian shows the highest emphasis on spices with all of it's seven top ingredients being spices.
There's a noticeable (and notable) lack of sugar anywhere in the list of ingredients which I find particularly surprising considering the prevalence or wheat (probably flour), eggs and butter as top ingredients in many cuisines. That is definitely an area that should be investigated in future analysis.